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Abstract 

We focus on the problem of query rewriting for sponsored search. We base rewrites on a historical click 
graph that records the ads that have been clicked on in response to past user queries. Given a query q, we 
first consider Simrank [S] as a way to identify queries similar to q, i.e., queries whose ads a user may be 
interested in. We argue that Simrank fails to properly identify query similarities in our application, and we 
present two enhanced versions of Simrank: one that exploits weights on click graph edges and another that 
exploits "evidence." We experimentally evaluate our new schemes against Simrank, using actual click graphs 
and queries form Yahoo!, and using a variety of metrics. Our results show that the enhanced methods can 
yield more and better query rewrites. 

1 Introduction 

In sponsored search, paid advertisements (ads) relevant to a user's query are shown above or along-side traditional 
web search results. The placement of these ads is in general related to a ranking score which is a function of the 
semantic relevance to the query and the advertiser's bid. 
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Figure 1: General sponsored search system architecture. 



Ideally, a sponsored search system would appear as in Figure [TJ The system has access to a database of 
available ads and a set of bids. Conceptually, each bid consists of a query q, an ad a, and a price p. With such 
a bid, the bidder offers to pay if the ad a is both displayed and clicked when a user issues query q. For many 
queries, there are not enough direct bids, so the sponsored search system attempts to find other ads that may be 
of interest to the user who submitted the query. Even though there is no direct bid, if the user clicks on one of 
these ads, the search engine will make some money (and the advertiser will receive a customer). The challenge is 
then to find ads related to incoming queries that may yield user click throughs. 

For a variety of practical and historical reasons, the sponsored search system is often split into two components, 
as shown in Figure O A front-end takes an input query q and produces a list of re-writes, i.e., of other queries 
that are "similar" to q. For example, for query "camera," the queries "digital camera" and "photography" may 
be useful because the user may also be interested in ads for those related queries. The query "battery" may also 
be useful because users that want a camera may also be in the market for a spare battery. The query and its 
rewrites are then considered by the back-end, which displays ads that have bids for the query or its rewrites. The 
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Figure 2: A common sponsored search system architecture. 



split approach reduces the complexity of the back-end, which has to deal with rapidly changing bids. The work 
of finding relevant ads, indirectly through related queries, is off-loaded to the front-end. 

At the front-end, queries can be rewritten using a variety of techniques (reviewed in our Related Work section) 
developed for document search. However, these techniques often do not generate enough useful rewrites. Part 
of the problem is that in our case "documents" (the ads) have little text, and queries are very short, so there is 
less information to work with, as compared with larger documents. Another part of the problem is that there 
are relatively few queries in the bid database, so even if we found all the textually related ones, we may not have 
enough. Thus, it is important to generate additional rewrites, using other techniques. 

In this paper we focus on query rewrites based on the recent history of ads displayed and clicked on. The 
back-end generates a historical click graph that records the clicks that were generated by ads when a user inputs a 
given query. The click graph is a weighted bi-partite graph, with queries on one side and ads on the other (details 
in Section [2|) . The schemes we present analyze the connections in the click graph to identify rewrites that may 
be useful. Our techniques identify not only queries that are directly connected by an ad (e.g., users that submit 
cither "mp3" or "i-tunes" click on ad an for "iPod.") but also queries that are more indirectly related (Section 
[3|). Our techniques arc based on the notion of SimRank [5j, which can compute query similarity based on the 
connections in a bi-partite click-graph. However, in our case we need to extend SimRank to take into account the 
specifics of our sponsored search application. 

Briefly, the contributions of this paper are as follows. 

• We present a framework for query rewriting in a sponsored search environment. 

• We identify cases where SimRank fails to transfer correctly the relationships between queries and ads into 
similarity scores. 

• We present two SimRank extensions: one that takes into account the weights of the edges in the click graph, 
and another that takes into account the "evidence" supporting the similarity between queries. 

• We experimentally evaluate these query rewriting techniques, using an actual click graph from Yahoo!, and 
a set of queries extracted from Yahoo! logs. We evaluate the resulting rewrites using several metrics. One 
of the comparisons we perform involves manual evaluation of query- rewrite pairs by members of Yahoo! 's 
Editorial Evaluation Team. Our results show that we can significantly increase the number of useful rewrites 
over those produced by SimRank and by another basic technique. 

1.1 Related Work The query rewriting problem has been extensively studied in terms of traditional web 
search. In traditional web search, query rewriting techniques are used for recommending more useful queries to 
the user and for improving the quality of search results by incorporating users' actions in the results' ranking 
of future searches. Given a query and a search engine's results on this, the indication that a user clicked on 
some results can be interpreted as a vote that these specific results are matching the user's needs and thus are 
more relevant to the query. This information can then be used for improving the search results on future queries. 
Existing query rewriting techniques for traditional web search, include relevance feedback and pseudo-relevance 
feedback, query term deletion [6], substituting query terms with related terms from retrieved documents pTj . 
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dimensionality reduction such as Latent Semantic Indexing (LSI) [4] , machine learning techniques |T3l [T2l [2] and 
techniques based on the analysis of the click graph [3] . 

Pseudo-relevance feedback techniques involve submitting a query for an initial retrieval, processing the 
resulting documents, modifying the query by expanding it with additional terms from the documents retrieved 
and then performing an additional retrieval on the modified query. However, pseudo-relevance feedback requires 
that the initial query retrieval procedure returns some results, something that is not always the case in sponsored 
search, as described before. In addition, pseudo-relevance has many limitations in effectiveness [5]. It may lead 
to query drift, as unrelated terms might be added to the query and is also computationally expensive. Query 
relaxation or deleting query terms leads to a loss of specificity from the original query. 

In LSI, a collection of queries is represented by a terms queries matrix where each column corresponds to 
the vector space representation of a query. The column space of that matrix is approximated by a space of much 
smaller dimension that is obtained from the leading singular vectors of the matrix and then similarity scores 
between different queries can be computed. LSI is frequently found to be very effective even though the analysis 
of its success is not as straightforward [8] . The computational kernel in LSI is the singular value decomposition 
(SVD). This provides the mechanism for projecting both the queries on a lower-dimensional space spanned by the 
leading left singular vectors. In addition to performing dimensionality reduction, LSI captures hidden semantic 
structure in the data and resolves problems caused by synonymy and polysemy in the terms used. However, a 
well known difficulty with LSI is the high cost of the SVD for the large, sparse matrices appearing in practice. 

2 Problem Definition 

Let Q denote a set of n queries and A denote a set of m ads. A click graph for a specific time period is an 
undirected, weighted, bipartite graph G = (Q,A,E) where E is a set of edges that connect queries with ads. 
G has an edge (q, a) if at least one user that issued the query q during the time period also clicked on the ad 
a. Each edge (q, a) has three weights associated with it. The first one is the number of times that a has been 
displayed as a result for q and is called the impressions of a given q. The second weight is the number of clicks 
that a received as a result of being displayed for they query q. This second weight is less than or equal to the 
first weight. The number of clicks divided by the number of impressions gives us the likelihood that a displayed 
ad will be clicked on. However, to be more accurate, this ratio needs to be adjusted to take into account the 
position where the ad was displayed. That is, an ad a placed near the top of the sponsored results is more likely 
to be clicked on, regardless of how good an ad it is for query q. Thus, the third weight associated with an edge 
{q, a) is the expected click rate, an adjusted clicks over impressions rate. The expected click rate is computed by 
the back-end (Figure [21), and we do not discuss the details here. 

Finally, for a node v in a graph, we denote by E(v) the set of neighbors of v. We also define N(v) = \E(v)\ 
that is N(v) denotes the number of w's total neighbors. 

As discussed in the introduction, our goal is to find queries that are similar, in the sense that the ads clicked 
on for one query are likely to be clicked on when displayed for a user that entered the second query. We will 
predict similarity based on the information in the click graph: The intuition is that if an ad received clicks when 
displayed for both queries q± and 92, then the queries are similar. Furthermore, if qi is related to 93 in the same 
way but through some other ad, then q\ and q% are also similar, although possibly to a lesser degree. We discuss 
our notion of similarity more in the following section. 

Note that if the click graph does not contain an ad a that received clicks when q\ and qi were issued, then 
we cannot infer that q\ and qi are not similar. The queries could very well be similar (in our sense), but while 
the click-graph was collected, the back-end did not display ads that would have shown this similarity. (Perhaps 
there were no bids for those ads at the time.) As we will see later, even without the common ad a, we may still 
be able to discover the similarity of q\ and 52 through other similarity relationships in the click-graph. 

Also note that in this paper we are not addressing problems of click or ad fraud. Fraud is a serious problem, 
where organizations or individuals generate clicks or place ads with the intent of defrauding or misleading the 
advertiser and/or the search engine. Query rewriting strategies may need to be adjusted to protect from fraud, 
but we do not consider such issues here. 

Finally, notice that our query rewriting problem is a type of collaborative filtering (CF) problem. We can 
view the queries as "users" who are recommending "ads" by clicking on them. When we identify similar queries, 
we are finding queries that have similar recommendations, just like in CF, where one finds users that have similar 
tastes. In our setting, we are only trying to find similar queries (users), and not actually predicting recommended 
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ads. Furthermore, as we will see, we are tuning our similarity metrics so they work well for sponsored search, as 
opposed to generic recommendations. 

3 Similar queries 

In this section we discuss the notion of query similarity that we are interested in. As we mentioned earlier, we 
will be saying that two queries are similar if they tend to make the search engine users to click on the same ads. 
Let us illustrate this with an example. Figure [3] shows a small click graph; for simplicity we have removed the 
weights from the edges and thus an edge indicates the existence of at least one click from a query to an ad. In this 
graph, the queries "pc" and "camera" are connected through a common ad and thus can be considered similar. 
Notice that this notion of similarity is not related to the actual similarity of the concepts described by the query 
terms. Now, we can observe that the queries "camera" and "digital camera" are connected through two common 
ads and thus can be considered similar. In contrast, queries "pc" and "tv" are not connected through any ad. 
However, both "pc" and "tv" are connected through an ad with the queries "digital camera" and "camera" which 
we already saw that are similar. Thus, we have a small amount of evidence that "pc" and "tv" are somehow 
similar, because they are both similar with queries that bring clicks to the same ads. In that case we will be 
saying that "pc" and "tv" are one hop away from queries that have a common ad. There might actually be cases 
where two queries will be two or more hops away from queries that bring clicks to the same ad. Finally, let us 
consider the queries "tv" and "flower" . There is no path in the click graph that connects these two queries and 
thus we conclude that these queries are not similar. 




Hp.com 
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Figure 3: Sample unweighted click graph. An edge indicates the existence of at least one click from a query to 
an ad. 

Thus, a naive way to measure the similarity of a pair of queries would be to count the number of common 
ads that they are connected to. Table Q] presents the resulting similarity scores for our sample click graph. As 
we can see there, "pc" has a similarity score 1 both with "camera" and "digital camera" but no similarity with 
"tv" and "flower" . However, "camera" has a similarity score 2 with "digital camera" which indicates a stronger 
similarity. Also, "tv" has similarity both with "pc" and "flower" . Notice also that flower has similarity with 
all the other queries. It is obvious that this naive technique cannot capture the similarity between "pc" and "tv" 
(as it does not look at the whole graph structure) and determines that their similarity is zero. In the following 
section we will see how we can compute similarity scores that take into account all the interactions appearing in 
the graph. 

4 Simrank-based query similarity 

Simrank [5] is a method for computing object similarities, applicable in any domain with object-to-object 
relationships, that measures similarity of the structural context in which objects occur, based on their relationships 
with other objects. Specifically, in the case where there are two types of objects, bipartite Simrank is an iterative 
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Table 1: Query-query similarity scores for the sample click graph of Figured Scores have been computed by 
counting the common ads between the queries 





pc 


camera 


digital camera 


tv 


flower 


pc 




1 


1 








camera 


1 




2 


1 





digital camera 


1 


2 




1 





tv 





1 


1 







flower 

















technique to compute the similarity score for each pair of objects of the same type. Bipartite Simrank is based 
on the underlying idea that two objects of one type are similar if they are related to similar objects of the second 
type. In our case, we can consider the queries as one type of objects and the ads as the other and use bipartite 
Simrank to compute similarity scores for each query-query pair. 

Let s(q,q') denote the similarity between queries q and q' , and let s(a,a') denote the similarity between ads 
a and a' . For q ^ q', we write the equation: 

where C% is a constant between and 1. For a ^ a' , we write: 

< 42 > E E ' {iJ) 

y 1 y ' ieE{a)j£E(a') 

where again Ci is a constant between and 1. 

If q = q', we define s(q,q') — 1 and analogously if a — a' we define s{a,a') — 1. Neglecting G\ and C2, 
eq uation 14 . 1 1 savs that the similarity between queries q and q' is the average similarity between the ads that were 
clicked on for q and q' . Similarly, equation 14.21 says that the similarity between ads a and a' is the average 
similarity between the queries that triggered clicks on a and a' . 

In the SimRank paper [5], it is shown that a simultaneous solution s(*,*) G [0,1] to the above equations 
always exists and is unique. Also notice that the SimRank scores are symmetric, i.e. s(q,q') = s(q',q). 

In order to understand the role of the C\, C'2 constants, let us consider a simple scenario were two ads a and 
a' were clicked on for a query q (which means that edges from q towards a and a' exist), so we can conclude 
some similarity between a and a' . The similarity of q with itself is 1, but we probably don't want to conclude 
that s(a,a') = s{q 1 q) = 1. Rather, we let s(a,a') = C2 ■ s(q,q), meaning that we are less confident about the 
similarity between a and a' than we are between q and itself. 

Let us look now at the similarity scores that Simrank computes for our simple click graph of Figure [3] Table 
[5] presents the similarity scores between all query pairs. If we compare these similarity scores with the ones in 
Table Q] we can make the following observations. Firstly, "camera" and "digital camera" have now the same 
similarity score with all other queries except for "flower". Secondly, "tv" has similarity 0.437 with "pc", 0.6f9 
with "camera" and "digital camera" and zero with "flower" . Notice that Simrank takes into account the whole 
graph structure and thus correctly produces a nonzero similarity score for the pair "tv" - "pc" . Also notice that 
"camera" has two common ads with "digital camera" and only one common ad with "tv" . However, Simrank 
does not produce different similarity scores for the "camera" -"digital camera" and "camera" -"tv" pairs. We will 
come back to this issue in detail in Section [6l 

5 Random walks behind Simrank 

The intuition behind the similarity scores that Simrank defines is based on a "random surfers" model. According 
to this, a Simrank score sim(a, b) measures how soon two random surfers are expected to meet at the same node 
if they started at nodes a, b and randomly walked the graph. The transition probabilities of this random walk 
are uniform, which means that (assuming C\ = C2 = 1) if a has n out-neighbors, with the same probability 1/n 
the random surfer will move to one of these out-neighbors. 
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Table 2: Query-query similarity scores for the sample click graph of Figure [3] Scores have been computed by 
Simrank with C% = C<x = 0.8 
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The decay factors C\,C 2 allow for self-transitions. Self-transitions correspond to transitions from a node to 
itself. C\ affects the self-transition probabilities of one of the graph's node sets while C 2 affects the self-transition 
probabilities of the other node set. Given that C\ < 1, with probability 1 — C\ a random surfer will remain in 
the same node and with probability Ci/n he will move to one of the n out-neighbors of the node. 

6 Simrank in complete bipartite graphs 

Some simple bipartite graphs that often appear as subgraphs of a click graph are the complete bipartite graphs. A 
complete bipartite graph is a special kind of bipartite graph where every vertex of the first node set is connected to 
every vertex of the second nodes set. In the click graph of Figure[31 the subgraphs consisting of the nodes "flower" , 
"Teleflora.com", "orchids.com" and "camera", "digital camera", "hp.com", "bestbuy.com" are two examples of 
complete bipartite subgraphs. Formally, a complete bipartite graph G — (Vi, V2, E) is a bipartite graph such that 
for any two vertices vi S V\ and v 2 6 V 2 , (v\, v 2 ) is an edge in E. The complete bipartite graph with partitions 
of size \Vi\ — m and | V2 1 — n, is denoted K m n . Figure HJa) shows a graph from a click graph and Figure 
0Jb) shows a -Kj. 2 click graph. 



Digital camera 




Hp.com p C 
Bestbuy.com camera 




Hp.com 



(a) 



(b) 



Figure 4: Sample complete bipartite graphs (-^2,2 and 2^1,2) extracted from a click graph. 



Let us look at the similarity scores that Simrank computes for the pairs "camera" - "digital camera" and 
"pc" - "camera" from the graphs of Figure [U Table [3] tabulates these scores for the first 7 iterations. As 
we can see sim( "camera" , "digital camera") is always less than sim("pc", "camera") although we observe that 
sim( "camera" , "digital camera") increases as we include more iterations. In fact, we can prove that sim( "camera", 
"digital camera" ) becomes eventually equal to sim( "pc" , "camera" ) as we include more iterations. We can actually 
prove the following two Theorems for the similarity scores that Simrank computes in complete bipartite graphs 
(refer to Appendix [X] for the proofs) . 

Theorem 6.1. Consider the two complete bipartite graphs G = K\^ 2 and G' — K 2t2 with nodes sets V± = 
{a}, V 2 = {A, B} and V{ — {b, c} and V 2 ' — {C, D} correspondingly. Let sim^ k \A, B) and szm^^C, D) denote the 
similarity scores that bipartite Simrank computes for the node pairs (A, B) and (C, D) after k iterations. Then, 
sim {k) (A,B)>sim {k) (C,D),Vk>0. | 

Theorem 6.2. Consider the two complete bipartite graphs G = K m . 2 and G' — K n ^ 2 with m < n and nodes sets 
Vi,V 2 = {A, B} and V{,V 2 = {C,D} correspondingly. Let sim^ k \A,B) and sim^ k \C, D) denote the similarity 
scores that bipartite Simrank computes for the node pairs (A, B) and (C, D) after k iterations. Then, 



(1) sim {k) {A,B) > sim (k) {C,D), V k > 0, and 
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Table 3: Query-query similarity scores for the sample click graphs of Figure [U Scores have been computed by 
Simrank with C% = C% = 0.8 
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(ii) linifc^oo sim( k \A, B) = limfc_>oo sim^ k \C,D) if and only if C\ = C2 = 1, where C%, C2 are the decay 
factors of the bipartite Simrank equations. | 

These Theorems provide us two pieces of evidence that Simrank scores are not intuitively correct in complete 
bipartite graphs. First, as in practice Simrank computations are limited to a small number of iterations, we would 
reach the conclusion that the pair "pc"- "camera" is more similar than the pair "camera" - "digital camera" which 
is obviously not correct. Second, even if we had the luxury to run Simrank until it converges, we would reach 
the conclusion that the similarity scores of the two pairs are the same. However, the fact that there are two 
advertisers that are connected with the queries "camera" and "digital camera" (versus the one that connects "pc" 
with "camera") is an indication that their similarity is stronger. We will try to fix such cases by introducing the 
notion of "evidence of similarity" in the following section. 

7 Revising Simrank 

Consider a bipartite graph G = (Vi, V2, E) and two nodes a, b £ V\. We will denote as evidence(a, 6) the evidence 
existing in G that the nodes a, b are similar. The definition of evidence(a, b) we use is shown on Equation 17.31 

\E(a)f]E(b)\ 

(7.3) evidence(a, b) — ^ — 

z=l 

The intuition behind choosing such a function is as follows. We want the evidence score evidence(a,b) to be an 
increasing function of the common neighbors between a and b. In addition we want the evidence scores to get 
closer to one as the common neighbors increase. Thus, another reasonable choice would be the following: 

(7.4) evidence^, b) = ( 1 - e-\ E ^ D E(b)\ 



In our experiments we used the first definition although preliminary results with both formulas did not show 
substantial differences. 

We can now incorporate the evidence metric into the Simrank equations. We modify the equations 14.11 and 
[421 as follows: 

For q ^ q', we write the equation: 

(7.5) ScvidcnccO?, q') = evidence^, q') ■ s(q, q') 
where s(q, q') is the Simrank similarity between q and q' . For a 7^ a', we write: 

(7.6) s cv idence(o:, ct) — evidence(a, a 1 ) ■ s(a, a') 

where again s(a, a') is the Simrank similarity between a and a'. 

Notice, that we could use k only iterations to compute the Simrank similarity scores and then multiply them 
by the evidence scores to come up with evidence-based similarities after k iterations. We will be loosely referring to 
these scores as evidence-based similarity scores after k iterations and we will be denoting them by Sovidcncc(9' *?')■ 
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Table 4: Query-query similarity scores for the sample click graphs of Figure [U Scores have been computed by 
the evidence-based Simrank with C\ = C 2 = 0.8 
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Let us see now what the new Simrank equations compute for our sample click graphs. Table [4] tabulates 
these scores. As we can see sim( "camera", "digital camera") is greater than sim("pc", "camera") after the first 
iteration. We can actually prove the following Theorem for the similarity scores that evidence-based Simrank 
computes in complete bipartite graphs (refer to Appendix 151 for the proof). 

Theorem 7.1. Consider the two complete bipartite graphs G = K m2 and G' — K n ^ with m < n and nodes sets 
Vi,V2 — {A, B} and V{,V 2 ' = {C, D} correspondingly. Let sim^ k \A,B) and sim^ k \C,D) denote the similarity 
scores that bipartite evidence-based Simrank computes for the node pairs (A, B) and (C, D) after k iterations and 
let C\ , C2 > \ , where C\ , C2 are the decay factors of the bipartite Simrank equations. Then, 

(1) sim^(A,B) < sirn^(C,D), V k > 1, and 

(ii) limfe^oo sim (fe) (A, B) < lim^oo sim (k \C,D). | 

This Theorem indicates that the evidence-based Simrank scores in complete bipartite graphs will be consistent 
with the intuition of query similarity (as we discussed it in Section [3]) even if we effectively limit the number of 
iterations we perform. 

8 Weighted Simrank 

In the previous sections we ignored the information contained in the edges of a click graph and we tried to derive 
similarity scores for query pairs by just using the click graph's structure. In this section, we focus on weighted 
click graphs. We explore ways to derive query-query similarity scores that (i) are consistent with the graph's 
weights and (ii) utilize the edge weights in the computation of similarity scores. 

8.1 Consistent similarity scores We illustrate the notion of consistency between similarity scores and the 
graph's weights with the following two examples. Firstly, consider the two weighted click graphs in Figure 
[3 Apparently the queries "flower" -"orchids" of the left graph are more "similar" than the queries "flower" - 
"teleflora" of the right graph. This is true because, although both pairs bring clicks to the same ad, the queries 
of the first pair bring equally the same amount of clicks whereas in the second pair the number of clicks each 
query brings differ a lot. If we now try to use Simrank or even the evidence-based Simrank to compute similarity 
scores for these two pairs we will see that it will output the exact same similarity scores for both pairs. It is thus 
obvious that Simrank scores are not consistent with the the weights on the graph. Now, consider the two graphs 

flower 
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Figure 5: Sample weighted click graphs 



flower 

~1000 "~1000 

^XD Teleflora.com Teleflora.com 
.1000'' ^ 

teleflora 



9 



of Figure [6] Apparently the similarity scores are no longer affected by the previous notion of consistency as in 
both graphs the spread of values of the right node is the same. However, it is also obvious that now the queries 
"flower-orchids" are more similar than the queries "flower-teleflora" since there are more clicks that connect the 
first pair with an ad. Again, Simrank or evidence-based Simrank will output the exact same similarity scores for 
both pairs. 
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Figure 6: Sample weighted click graphs 
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In general, we define the notion of consistency as follows: 

Definition 8.1. (Consistent similarity SCORES) Consider a weighted bipartite graph G = (Vi,V 2l E). 
Consider also two nodes vi,v 2 G V 2 and four nodes ii, Ji, *a> J*2 £ V\- We now define the sets W{v\) — 
{w(ii, Vi), w(ji, Vi)} and Wfa) — {w (12, V2), w(j2, 1*2)} and let variance{vi) (variance(v 2 ) ) denote a measure 
of W(v\)'s (W(v2)'s) variance respectively. We will be saying that a set of similarity scores sim(i, j)Vi, j G 
V\ is consistent with the graph's weights if and only if Vii, ji, Z2, J2 £ V\ and \/v±,V2 £ V2 such that 
3(ii, (ji, v%), (12, V2), (j2) V2) £ E both of the following are true: 

(i) If variance(v\) = variance(v 2 ) and w(ii,vi) > w(i2,V2)) then sim(ii,ji) > sim(i2,j2) 

(ii) If variance(v\) < variance(v2) and w(i\,v\) > wfojV?)) then sim(ii,ji) > sim(i2,j2) 

8.2 Revising Simrank We can now modify the underlying random walk model of Simrank. Again we use 
the evidence scores as defined in Section but now we will perform a different random walk. Remember that 
Simrank's random surfers model implies that a Simrank score sim(a, b) for two nodes a, b measures how soon 
two random surfers are expected to meet at the same node if they started at nodes a, b and randomly walked the 
graph. In order to impose the consistency rules in the similarity scores we perform a new random walk where its 
transition probabilities p(a, i), Va £ V\, i G E(a) are defined as follows: 

p(a,i) = spread(i) • normalized_weight(a, i), \/i G E{a), and 

p(a,a) = 1 - ^2 p(a,i) 

iEE{a) 

where: 

spread(i) = e -™riance(i) j and 

„ . . . . .. w(a,i) 

normahzed_weight(a, 1) = — -. r 

Notice how the new transition probability p(a,i) between two nodes a £ Vi,i £ V2 utilizes both the spread(i) 
value and the w(a, i) value in order to satisfy the consistency rules. Actually, we can prove the following Theorem 
that ensures us that weighted Simrank produces consistent similarity scores. 

Theorem 8.1. Consider a weighted bipartite graph G — (Vi,V2, E) and let w(e) denote the weight associated 
with an edge e G E. Let also sim(i,j) denote the similarity score that weighted Simrank computes for two nodes 
i,j G V\. Then, Wi,j G V\, sim(i,j) is consistent with the graph's weights. | 



10 



The actual similarity scores that weighted Simrank gives after applying the modified random walk are: 
s W cightod(g, q) = evidence^, q) ■ C\ W(q,i)W(q'j)s wcightcd (i,j) 

i&E{q) jeE(q') 

Sweighted («>«') = evidence (a, a') ■ C 2 2J X! W(a,i)W(a' t j)^weighted(^? j) 

!(EB(a) jEE(a') 

where the factors W(q,i) and W(a,z) are defined as follows: 

W(q,i) = spread(i) • normalized_weight(g, i) = e -™riance(i) w (g^) and 



W(a,i) — spread(i) ■ normalized_weight(a, i) = e 



-variance(i) j) 



9 Experiments 

We conducted experiments to compare the performance of Simrank, evidence-based Simrank and weighted 
Simrank as techniques for query rewriting. Our baseline was a query rewriting technique based on the Pearson 
correlation. 

9.1 Baseline The Pearson correlation between two queries q and q' is defined as: 

S a gg(q)n£(g')( w (g' a ) -w q )(w(q',a)-w q <) 



simp CarS o„(g,<?') = 



^/EttefiwnBh'jW?- ) -w g ) 2 w(q',a) - w q >) 2 



where w q — YlieE(q) ^ s ^ ne avera g e weight of all edges that have q as an endpoint. If E(q) f] E(q') = 

then sinipearson^, q) = 0. The Pearson correlation indicates the strength of a linear relationship between two 
variables. In our case, we use it to measure the relationship between two queries. Notice, that sim poarson takes 
values in the interval [—1, 1] and it requires that the two queries q and q' have at least one common neighbor in 
the click graph. 



9.2 Dataset We started from a two-weeks click graph from US Yahoo! search, containing approximately 15 
million distinct queries, 14 million distinct ads and 28 million edges. An edge in this graph connects a query with 
an ad if and only if the ad had been clicked at least once from a user that issued the query. In addition, each edge 
contains the number of clicks, the number of impressions, as well as the expected click rate. This graph consists 
of one huge connected component and several smaller subgraphs. In all our experiments that required the use of 
an edge weight we used the expected click rate. 

To make the dataset size more manageable, we used the subgraph extraction method described in [T] to 
further decompose the largest component and we produced five smaller subgraphs. In summary, the algorithm 
in [1] is an efficient local graph partitioning algorithm that uses the PageRank vectors. Given a graph and an 
initial node, it tries to find a cut with small conductance near that starting node. We started from different 
nodes and run the algorithm iteratively in order to discover big enough, distinct subgraphs. Table [5] tabulates 
the total number of nodes (queries and ads) and edges contained in the five-subgraphs dataset. We also observed 
a number of power-law distributions, including ads-per-query, queries-per-ad and number of clicks per query-ad 
pair. We used this dataset as the input click graph for all query rewriting techniques we experimented with. 

The query set for evaluation is sampled, with uniform probability, from live traffic during the same two- weeks 
period. This traffic contains all queries issued at Yahoo! during that period; even the ones that did not bring any 
clicks on a sponsored search result. More specifically, we used a standardized 1200 query sample that has been 
generated by the above procedure and is currently being used as a benchmark at Yahoo!. We looked at these 



i Thc conductance is a way to measure how hard it is to leave a small set of a graph's nodes. If <I>5 is the conditional probability 
of leaving a set of nodes S given that we started from a node in S, then the conductance is defined as the minimal <J>g over all sets S 
that have a total stationary probability of at most 1/2. More information can be found in 1101 . 
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Table 5: Dataset statistics 




# of Queries 


# of Ads 


# of Edges 


subgraph 1 


585,218 


434,938 


1,280,920 


subgraph 2 


530,797 


374,243 


1,130,314 


subgraph 3 


322,252 


214,952 


713,253 


subgraph 4 


313,951 


243,406 


703,747 


subgraph 5 


91,195 


87,442 


216,828 


Total 


1,843,413 


1,354,981 


4,045,062 



1200 queries and extracted only the ones that actually appear in our five-subgraphs dataset as only for those 
our query rewriting methods would be able to provide rewrites. We found out that these are 120 queries and 
these are the queries that constitute our evaluation set. Using such an evaluation query selection procedure we 
made sure that queries issued rarely had a smaller probability of appearing in the evaluation set whereas more 
popular queries could appear with higher probability. We made this decision since we are interested in comparing 
the query rewriting techniques using a realistic query set. In other words, we prefer a rewriting technique that 
provides high quality rewrites for popular queries from another one that does the same only for rare queries. 

9.3 Evaluation Method We run each method on the five-subgraphs dataset and recorded the top 100 rewrites 
for each query on our queries sample. We then use stemming to filter out duplicate rewrites (notice that such 
rewrites might appear in the click graph). In addition we perform bid term filtering, i.e., we remove queries that 
are not in a list of all queries that saw bids in the two-week period when the click graph was gathered. This list 
contains any query that received at least one bid at any point in the period; hence, if a query is not in the list it 
is unlikely to have bids currently. (Note that such queries with no bids may still be connected to ads in the click 
graph. These ads were displayed and clicked on because of query rewriting that took place when the query was 
originally submitted.) 

The queries that remain after duplicate elimination and bid term filtering are considered for our evaluation. 
However, we limit ourselves to at most 5 rewrites per query per method because of the cost of the manual 
evaluation we describe next. Note that a method may generate fewer than 5 rewrites after filtering. We call the 
number of remaining rewrites the depth of a method. 

To evaluate the quality of rewrites, we consider two methods. The first is a manual evaluation, carried out 
by professional members of Yahoo! 's editorial evaluation team. Each query - rewrite pair is considered by an 
evaluator, and is given a score on a scale from 1 to 4, based on their relevance judgment. (The scoring is the 
same as used in [3 [14]). The query rewrites that were more relevant with the original query assigned a score of 
1, and the least related assigned a score of 4. Table [5] summarizes the interpretation of the four grades and their 
description is shown below. 

1. Precise rewrite: The query rewrite matches the user's intent and it preserves the core meaning of the original 
query 

2. Approximate rewrite: The query rewrite has a direct close relationship to the topic described by the initial 
query, but the scope has narrowed or broadened or there has been a slight shift to a closely related topic. 

3. Possible rewrite: The query rewrite either has some categorical relationship to the initial query (i.e. the 
two are in the same broad category of products or services) or describes a complementary product, but is 
otherwise distinct from the original user intent. 

4. Clear Mismatch: The query rewrite has no clear relationship to the intent of the original query. 

The judgment scores are solely based on the evaluator's knowledge, and not on the contents of the click graph. 
Our second evaluation method addresses the question of whether our methods made the "right" decision based 
on the evidence found in the click graph. The basic idea is to remove certain edges from the click graph and to 
see if using the remaining data our schemes can still make useful inferences related to the missing data. 
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Figure 7: Sample setup for testing the ability of a rewriting method to compute correct query rewrites. By 
removing the red, dashed edges, we remove all direct similarity evidence between q\ and q 2l 93. 



Table 6: Editorial scoring system for query rewrites. 



Score 




Definition 


Example (query - re-write) 


1 


Precise Match 


near-certain match 


corvette car - Chevrolet corvette 


2 


Approximate Match 


probable, but inexact match with user intent 


apple music player - ipod shuffle 


3 


Marginal Match 


distant, but plausible match to a related topic 


glasses - contact lenses 


4 


Mismatch 


clear mismatch 


time magazine - time & date magazine 



In particular, consider Figure [7J showing two queries q 2 and qs that share at least one common arc with 
a query q\. In order to distinguish which query between q 2 and q^ is a preferable rewrite for qi, we define the 
desirability of query q 2 for query q x as des(q 1 ,q 2 ) = J2ieE( qi )r\E(q 2 ) \ E(q 2 )\ "'"■'(flM). By computing the desirability 
scores des(gi, (72), des(<7i, (73) we can determine the most desirable rewrite for q\. That is, given the evidence in 
the graph, if des(gi, q 2 ) > des(qi, 173) then q 2 would be a better rewrite for qi than (73. 

Given our definition of desirability, we can now conduct the following experiment. First, we remove the edges 
that connect qi to ads that are also connected with q^ or q 2 . In Figure [Jj these are the red, dashed edges. Then, we 
run each variation of Simrank on the remaining graph and record the similarity scores smi(qi,q 2 ) and sim(gi,g3) 
that the method gives. Finally, we test whether the ordering for (72,(73 that these similarity scores provide is 
consistent with the ordering derived from the desirability scores. In our example, if des(qi,q 2 ) > des(gi,<73) and 
sim(<7i, q 2 ) > sim(qi,g3) then we would say that the similarity score was successful in predicting the desirable 
rewrite. 

We repeated this edge removal experiment for 50 queries randomly selected from our five-subgraphs dataset. 
These queries played the role of query qi as described above. For each of those queries we identified all the queries 
from the dataset that shared at least one common ad with it and we randomly selected two of them. Those were 
the q 2 and 53 queries. In order to make sure that a Simrank similarity score can be computed after the deletion of 
the edges in our experiment, we selected the queries q 2l (73 after making sure that after edge removal there would 
still exist a path from q 2 to q\ and from (73 to q\ through other edges in the graph. Since Pearson correlation only 
can be used provided that there is at least a common ad between two queries, we did not include the technique 
in this part of our evaluation. 

9.4 Metrics The evaluation metrics we used were the following four: 

(i) Precision/recall: We consider two IR tasks. Firstly, we interpret the rewrites with scores 1-2 as relevant 
queries and the rewrites with scores 3-4 as irrelevant queries. Secondly, we interpret as relevant query 
rewrites only the ones with score 1 and the rest as irrelevant. Thus, we can define the precision/recall of 
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method m for query q as follows: 

relevant rewrites of q that m provides 



precision(g, m) = 
recall(g, m) — 



number of rewrites for q that m provides 

relevant rewrites of q that m provides 
number of relevant rewrites for q among all methods 



(ii) Query Coverage: We are also interested in the absolute number of queries (from our 120 query sample) 
for which each method manages to provide at least one rewrite. We call this number query coverage. In 
general, we prefer methods that cover as many as possible queries. 

(hi) Query rewriting depth: Here, we are interested in the total number of query rewrites that a method provides 
for a given query. This is called the depth of a query rewriting technique. Again, we are interested in methods 
that have larger rewriting depth. 

(iv) Desirability prediction: For our desirability experiment, we report the fraction of the 50 queries for which 
a method was able to correctly predict the desirability of qi (or 93) over the other query. 



10 Results 




Figure 8: Comparing the query coverage of Pearson and Simrank 



10.1 Query Coverage Figure fS] illustrates the percentage of queries from the 120 queries sample that Pearson 
and Simrank provide rewrites for. Simrank provides rewrites almost for all queries (98%) when Pearson gives 
rewrites only for the 41% of the queries. This can be considered as expected, since Pearson can only measure 
similarity between two queries if they share a common ad, whereas Simrank takes into account the whole graph 
structure and does not require something similar. Also notice, that evidence-based Simrank further improves the 
coverage to 99%. 

10.2 Precision- Recall Figure [9] presents the precision/recall graphs for Pearson and Simrank as well as the 
precision at 1-5 queries (P@X). For the computation of precision and recall the editorial scores were used in a 
binary classification manner; scores 1-2 were the positive class and scores 3-4 the negative class. For instance, in 
Figure O (bottom graph) we see that Weighted Simrank has 93% precision for 2 rewrites, meaning that 93% of its 
rewrites in the top two ranks were given scores of 1 or 2 by the evaluators. FigureQUlpresents more precision/recall 
graphs for Pearson and Simrank as well as precision at 1-5 queries (P@X). However, now, the positive class of 
the binary classification problem consists of the editorial score 1, whereas the negative class contains the editorial 
scores 2-4. 

In both cases we see that simple Simrank substantially improves the precision of the rewrites compared to 
Pearson. In addition, the use of the evidence score and the exploitation of the graph weights further boosts the 
precision, as expected. 
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Precision-recall graphs 



- A — 


Simrank 


- v - 


Pearson 


-o- 


evidence-based Simrank 


- * - 


weighted Simrank 



•o o o e 



v- — V - - 



-v v v 



0.1 0.2 0.3 0.4 



0.5 0.E 
recall 



Precision after X query rewrites (P@X) 





-A- 


Simrank 




- v- 


Pearson 




-o- 


evidence-based Simrank 




- * - 


weighted Simrank 



# of query rewrites 



Figure 9: Precision at 11 standard recall levels (top) and precision after X — 1,2, ... ,5 query rewrites (P@X) 
(bottom) using as positive class rewrites with score {1-2} and negative class rewrites with score {3-4} 
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Precision-recall graphs with threshold 1 
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* - weighted Simrank 
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, V- ■ 



# of query rewrites 



Figure 10: Precision at 11 standard recall levels (top) and precision after X = 1, 2, . . . , 5 query rewrites (P@X) 
(bottom) using as positive class rewrites with score 1 and negative class rewrites with score {2-4} 
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10.3 Rewriting Depth Figure QT] compares the rewriting depth of Pearson and the variations of Simrank. 
Note that our two enhanced schemes can provide the full 5 rewrites for over 85% of the queries. As mentioned 
earlier, the more rewrites we can generate, the more options the back-end will have for finding ads with active 
bids. 

Comparing the rewriting depth of Pearson and variations of Simrank 




I Simrank 
i-based Simrank 



5 4-5 3-5 2-5 1-5 

# of rewrites 



Figure 11: Comparing the rewriting depth of Pearson and Simrank 



Evaluating correctness of order prediction 




Simrank evidence-based Simrank weighted Simrank 



Figure 12: Comparing the ability of query rewriting methods to correctly predict the order of query rewrite 
candidates 



10.4 Desirability prediction Figure [12] provides the results of our experiments for identifying the correct 
order of query rewrites as described in Section 19.31 Simple Simrank and evidence-based Simrank manage to 
predict successfully the desirable rewrite for 27 out of the 50 queries (54%). Note that both methods do not 
exploit the graph weights in the similarity computations and rely only on the graph structure. Weighted Simrank 
predicts correctly the desirable rewrite for 46 queries (92%). 
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10.5 Discussion As we can see, simple Simrank outperforms Pearson both in query coverage, rewriting depth 
and precision/recall. Notice here that this version of Simrank does not utilize at all the qualitative information 
in the click graph, whereas Pearson does. 

The introduction of evidence scores increases query coverage slightly (by 1%) and substantially improves the 
quality of the rewrites. For instance, the precision at 5 rewrites of simple Simrank is 75% whereas the precision 
after 5 rewrites of evidence-based Simrank is 80% (Figure [9]). In addition, in the P@X rewrites diagram (Figure 
[9]) the line corresponding to the precision of evidence-based Simrank is always above the one corresponding to the 
precision of simple Simrank. Finally, evidence-based Simrank increases the rewriting depth. For example, simple 
Simrank provides five rewrites for 79% of the queries, whereas evidence based Simrank gives five rewrites for the 
89% of the queries (Figure [TTjl . 

Weighted Simrank builds upon evidence-based Simrank and utilizes the graph weights. It maintains the 
query coverage percentage of evidence-based Simrank at 99% (Figure E|) and substantially improves the quality 
of the rewrites. Figure [9] shows that the P@X line of weighted Simrank is always above the one of evidence-based 
Simrank. The precision at 5 rewrites of weighted Simrank goes from 80% (evidence-based Simrank) to 86%. Also, 
96% of the queries have a high-quality top rewrite when we use weighted Simrank (P@l, Figured]) when the 
corresponding percentages for evidence-based Simrank, simple Simrank and Pearson are 81%, 80% and 70%. In 
our desirability experiment, weighted Simrank predicted successfully the desirable rewrite for 92% of the cases 
(Figure [L2|). Finally, weighted-based Simrank maintains the rewriting depth of evidence-based Simrank (Figure 

11 Conclusions 

In this paper we focused on the problem of query rewriting for sponsored search. We proposed Simrank to exploit 
the click graph structure and we introduced two extensions: one that takes into account the weights of the edges 
in the click graph, and another that takes into account the "evidence" supporting the similarity between queries. 
Our experimental results show that weighted-based Simrank is the overall best method for generating rewrites 
based on a click graph. 

There are several query rewriting issues that we did not address in our analysis. Spam clicks can mislead our 
techniques and thus spam-resistant variations of our techniques would be useful. Also, methods for combining 
our similarity scores with semantic text-based similarities could be considered. 

Even though our new schemes were developed and tested for query rewriting based on a click graph, we 
suspect that the weighted and evidence-based Simrank methods could be of use in other applications that exploit 
bi-partite graphs. We plan to experiment with these schemes in other domains, including collaborative filtering. 
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A Simrank similarity scores on complete bipartite graphs 

Theorem A.l. Consider the complete bipartite graph Ki t i with nodes sets Vi — {a, b} and V2 — {A,B}. Let 
sim^ (A, B) denote the similarity between nodes A, B that bipartite Simrank computes after k iterations and let 
Ci,C2 denote the Simrank decay factors. Then: 

(i) simW(A,B) = ^Eli ^tC^C^ 

(ii) linifc^oo sim (k) (A,B) < C 2 

Proof (i) We will follow the computation of the Simrank similarity scores from equations \4.1\ and \4-2\ 
• Iteration 1: 

sim^(A,B) = ^(1 + 1) 

C2 
2 

C 2 ^ 1 ijj rirl] 



slm W(a,b) = ^(1 + 1) 

Cx 
2 



Iteration 2: 



simW(A,B) = + 1 



C2 C\ C\ 

"T T 

C2 C\ ■ C2 

T 4 

C-2 , C\ 



2 v 1 ' 2 

c 2 <^ 1 Lfj^r^i 
2^ ft 1 



2^2 



2 



Iteration 3: 



C\ C2 C2 

' T + T 

C\ C\- C2 

T 4 

^fi + ^ 
2 V 2 



sim w (.A, is) = 11 + 1+ h 
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• (3)r i.\ Ci ( C 2 C\ ■ C 2 \ ( C 2 C\ ■ c 2 

W3)( fl ,6) = _^l + l + ^ T + ^_j + ^ T + ^ 
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Iteration 4-' 
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Iteration 5: 
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• Iteration 6: 

fi / / ft ft ft f2 ft ft2 ft2 ft3 ft2 \ \ 



2 4 8 16 32 64 

/-y / ^ ft ft ft2 /^» /~f2 /^<2 /^<3 
^2 / ^ W i O2 ' Li 1 | O x • O2 ! O x • C/ 2 | O x • U 2 
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We can easily observe that sim^(A,B) = ^ £*=i 2 xr T C' 1 L§J C*^ 1 
We know that C\ < 1 and C2 < 1- Thus: 

C 2 ^l \^ C 2 ^ 1 

1 2 -T^-2^ 

i=l i=l 

TVow we can write: 



lim W fe > (A B) = lim — V — - r c}* J cl^ 1 

< i im ^yJL4 lim f!4. 2 = C2 

- fe^oo 2 ^ 2 1 " 1 2 fc^oo^z 2 

i=l i=l 

T/ius, limfc^oo sim {k) (A, B) <C 2 . | 

Theorem A. 2. Consider the two complete bipartite graphs G = K\ 2 and G' = K 2 ^ 2 with nodes sets V\ — 
{a}, V 2 — {A, B} and V[ = {b, c} and V 2 = {C, D} correspondingly. Let sinv '(A, B) and sim {k) {C,D) denote the 
similarity scores that bipartite Simrank computes for the node pairs (A, B) and (C, D) after k iterations. Then, 
sim (k) (A,B)>sim {k) {C 1 D),Vk>0. 



Proof From equations \4-l\ \4-^\ we have: 



sim {k \A,B) = -^-1 = C 2 , Vfc > 



2 



1-1 

Also, from Theorem \A. lV i) . we have: 



aim™ (C, D) = ^Y -irC# J Cf^ 1 < lim ^ V = ^ ■ 2 = G 

i=l i=l 

77ws, W fe) (A ;J B) > sim (k) (C,D), V fc > 0. | 
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Theorem A. 3. Consider the two complete bipartite graphs G = K% }2 and G' — K2.2 with nodes sets Vi = 
{a}, Vi = {A, B} and V{ — {b, c} and V 2 ' = {C, D} correspondingly. Let sim^ k \A, B) and sim^ k \C, D) denote the 
similarity scores that bipartite Simrank computes for the node pais (A, B) and (C, D) after k iterations. Then, 
limfc^oo sim^ k \A, B) — lim^oo sim^ k \C,D) if and only if C\ = C'2 = 1, where C\, C2 are the decay factors of 
the bipartite Simrank equations. 

Proof Let us assume that limfc— (00 sim^ k \A, B) — lim^oc sim^ k \C,D). 



That means that: 



lim sim^ (A, B) = lim sim^ (C, D) <^> 

k — >oc k — >oo 

C 2 = lim ^Vi-Cl^p^ 

i=l 

2 fe^oo 2 1 - 1 1 2 

1 = - Um -L-cl^cI^ 1 & 
2 fc^oo 2 1 - 1 1 2 



lim -L-C^cl^ =2^ 

Ci = C 2 = l 
TVow, Zet us assume that C\ — C2 — 1. VFe wiZZ /iai>e: 



lim sim {k) (C,D) = lim ^^^-d^C^ 1 

i=i 

fe^oo 2 ^ 2 1 - 1 
i=i 

= C2 



Thus, limfc^oo sim^(A,B) = limfe^oo sirnP^{C,D) if and only if C\ — C2 = 1. 



= lim W fc) (A,5) 

/e — >oo 



COROLLARY A.l. Consider the two complete bipartite graphs G = K\ >2 and G' = K2,2 with nodes sets 
Vi = {a},V 2 = {A,B} and V{ = {b, c} and V 2 ' = {C,D} correspondingly. Let sim (k) {A, B) and sirr£ k \C,D) 
denote the similarity scores that bipartite Simrank computes for the node pais {A, B) and (C, D) after k iterations. 
Then, if for the decay factors of bipartite Simrank C%, C2 we know that C\ < 1 or C2 < 1, both of the following 
are true: 

(1) sim {k) (A,B) > sim (k) {C,D), V k > 0, and 
(ii) linife^oo sim^ (A, B) > Imik^oo sim {k) (C,D) 
Proof (i) Lt follows directly from Theorems \A.2i and \A.3[ 
(ii) We have: 

C 2 ^ 1 ^LIUI"^! 



lim Sl m^(C,D) = lim-l^lc-} 

k—>oo k^oo Z £ — ' 2 

i=l 



c 2 



< lira — — — r 

i=l 



c 2 

k 



c 2 



lim sim w {A,B) 

k—*OG 
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Thus, limfe^oo sim {k) (A, B) > lim^oo sim (k) (C ', D) . | 

Theorem A. 4. Consider the two complete bipartite graphs G = K m .2 and G' = K n ^ with m < n and nodes sets 
Vi,V2 = {A, B} and V{,V 2 ' = {C,D} correspondingly. Let sim^ k \A,B) and sim^ k \C, D) denote the similarity 
scores that bipartite Simrank computes for the node pairs (A, B) and (C, D) after k iterations. Then, 

(i) sim {k) (A,B) > sim {k) {C,D), V k > 0, and 

(ii) lim^^oo sim^ k \A,B) = lim/^oo sim^ k \C,D) if and only if C\ = Ci = 1, where C\, C2 are the decay 
factors of the bipartite Simrank equations. 

Proof Similar arguments as in Theorem \A.l\ \ 
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B Evidence-based Simrank similarity scores on complete bipartite graphs 

Theorem B.l. Consider the complete bipartite graph K 2 ,2 with nodes sets V% — {a, b} and V2 = {A,B}. Let 
sirrv- k '(A, B) denote the similarity between nodes A, B that evidence-bsaed bipartite Simrank computes after k 
iterations and let C\,Ci denote the Simrank decay factors. Then: 

(i) sim^(A,B) = (| + i) ■ f £- =1 ^C^CI^ 
(it) IfC 1 ,C 2 >\ then lim*.^ sir4 k) {A,B) > % 

Proof (i) It follows directly from the definition of evidence-based Simrank ( Equations ] 7. 5\ and \ 7. 6^ and Theorem 

(ii) We have: 

lim sim^ k \A, B) = 

A;—* 00 

> 



I 

Theorem B.2. Consider the two complete bipartite graphs G — K\_2 and G — K2.2 with nodes sets V\ = 
{a},V2 = {A, B} and V( = {&, c} and V 2 ' = {C,D} correspondingly. Let sim^ k \A, B) and sim^{C,D) denote 
the similarity scores that bipartite evidence-based Simrank computes for the node pais (A, B) and (C,D) after k 
iterations. Then, if for the decay factors of bipartite Simrank C\ , C2 we know that G\ , C2 > \ we have 

(1) sim {k) {A,B) < sim (k \C,D), V k > 1, and 
(ii) lim^oo sim i - k \A, B) < Imik^oo sim {k) {C,D) 
Proof (i) It follows directly from the definition of evidence-based Simrank (Equations ] 7. 5\ and \ 7. 0) and Theorem 

ED 

(ii) From Theorem \B.l\ we have: 

lim sim {k) {C,D) > — 

k — >oo 2 

Also, from the definition of evidence-based Simrank ( Equations \ 7. 5\ and \ 7. &) we have: 

lim sim {k \A,B) = — 

k— *oo 2 

Thus, linifc^oo sim {k) (A, B) < lim^oo sim (k \C,D). | 

Theorem B.3. Consider the two complete bipartite graphs G — K m .i and G' = -Kn,2 with m < n and nodes sets 
Vi, V2 — {A, B} and V{,V 2 ' = {C,D} correspondingly. Let sim^ k \A,B) and sim^ k \C,D) denote the similarity 
scores that bipartite evidence-based Simrank computes for the node pairs (A, B) and (C, D) after k iterations and 
let C\ , C2 > \ , where G\ , C2 are the decay factors of the bipartite Simrank equations. Then, 



k 

0.41666 • C* 2 lim V — 

i=l 

0.41666 ■ C 2 ■ - 
0.5555 • C 2 > -y 



(i) sim {k) {A,B) < sim (k) (C,D), V k > 1, and 
(ii) limfc-vco sim,( k ' (A, B) < limk^oo sim^ k \C, 
Proof Similar arguments as in Theorem \B.2[ 



